Benchmarking V2: framework impl#40486
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
cc @ydshieh maybe? not sure who to ping for daily benchmark runs |
|
@McPatate would be better person :-) |
McPatate
left a comment
There was a problem hiding this comment.
Did a first pass, will continue tomorrow.
Awesome work! 🔥
|
|
||
| # Run with custom parameters | ||
| python run_benchmarks.py \ | ||
| --warmup-iterations 5 \ |
| "measurement_iterations": 5 | ||
| } | ||
| }, | ||
| "measurements": { |
There was a problem hiding this comment.
Would it make sense to have individual events (or aggregates over small time buckets) to be able to plot the data on top of having variance / means?
There was a problem hiding this comment.
I'm all for doing this in a later version. This is fine as a first step, but we can definitely stat nerd the thing.
benchmark_v2/run_benchmarks.py
Outdated
| format='[%(levelname)s - %(asctime)s] %(name)s: %(message)s', | ||
| handlers=[ | ||
| logging.StreamHandler(sys.stdout), | ||
| logging.FileHandler(f'benchmark_run_{datetime.now().strftime("%Y%m%d_%H%M%S")}.log') |
There was a problem hiding this comment.
Not sure we really need a file logger, maybe enable this with a cmd line arg?
There was a problem hiding this comment.
It proved itself useful for development, I'll put it behind a cmd flag - disabled by default.
benchmark_v2/run_benchmarks.py
Outdated
| logger: logging.Logger | ||
| ) -> str: | ||
| """Generate a summary report of all benchmark runs.""" | ||
| summary_file = os.path.join(output_dir, "benchmark_summary.json") |
There was a problem hiding this comment.
Perhaps we should add a timestamp to the file name so multiple runs won't overwrite the existing content.
Also, not sure a file is needed, stdout is ok imo
There was a problem hiding this comment.
On a second thought, I'm not even sure this file is useful.
There was a problem hiding this comment.
I kept it at the end with a timestamp.
|
|
||
| import torch | ||
|
|
||
| os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1" |
There was a problem hiding this comment.
Is this necessary now that we use xet for most repos?
There was a problem hiding this comment.
Not really, removing it.
benchmark_v2/benchmark_framework.py
Outdated
| import torch | ||
|
|
||
|
|
||
| class CUDATimer: |
There was a problem hiding this comment.
| class CUDATimer: | |
| class GPUTimer: |
perhaps? CUDATimer sounds a bit narrow given cuda is optional
There was a problem hiding this comment.
Or something like ArchAwareTimer idk 😄
benchmark_v2/benchmark_framework.py
Outdated
| time_to_first_token: Optional[float] = None | ||
| latency: float = 0.0 |
There was a problem hiding this comment.
I think it could be neat to have the unit in the name like so:
| time_to_first_token: Optional[float] = None | |
| latency: float = 0.0 | |
| time_to_first_token_seconds: Optional[float] = None | |
| latency_seconds: float = 0.0 |
for clarity, like below
There was a problem hiding this comment.
or at least a doc string for each parameter
| time_to_first_token: Optional[float] = None | ||
| latency: float = 0.0 | ||
| tokens_per_second: Optional[float] = None | ||
| time_per_output_token_seconds: Optional[float] = None |
There was a problem hiding this comment.
Isn't this the same as latency?
There was a problem hiding this comment.
Ok just googled what everything is, I think ITL (inter-token latency) is clearer than TPOT, but from what I can see it looks like TPOT is more widely used.
Do you think it's ok for us to use ITL instead?
There was a problem hiding this comment.
I'm fine with ITL. Personally, I'm a tok/sec guy, but we also have that. :)
|
|
||
|
|
||
| @dataclass | ||
| class TimingResult: |
There was a problem hiding this comment.
Is this for one event or is this storing averages?
There was a problem hiding this comment.
This is for storing the results of a complete benchmark scenario. This is the data that gets serialized into the JSON output. (and this is where we could add the time-series data later)
There was a problem hiding this comment.
In that case perhaps we should add _mean[_] to field names of this class? I assume the values are means.
There was a problem hiding this comment.
Pardon, late evening brainfart. TimingResult represents the data point of a benchmark iteration, while BenchmarkStatistics holds the derived stats.
benchmark_v2/benchmark_framework.py
Outdated
| def __init__(self, sample_interval: float = 0.1, logger: logging.Logger = None): | ||
| self.sample_interval = sample_interval | ||
| self.logger = logger or logging.getLogger(__name__) | ||
| self.monitoring = False |
benchmark_v2/benchmark_framework.py
Outdated
| return self._default_prompt | ||
|
|
||
| @property | ||
| def model_type(self) -> str: |
There was a problem hiding this comment.
What's the purpose of model_type?
There was a problem hiding this comment.
Leftover from another iteration, removed it. :)
benchmark_v2/benchmark_framework.py
Outdated
|
|
||
| if ttft_measurements: | ||
| ttft_stats = BenchmarkStatistics.from_measurements("time_to_first_token", ttft_measurements) | ||
| scenario_results["measurements"]["time_to_first_token"] = asdict(ttft_stats) |
There was a problem hiding this comment.
| scenario_results["measurements"]["time_to_first_token"] = asdict(ttft_stats) | |
| scenario_results["measurements"]["time_to_first_token_seconds"] = asdict(ttft_stats) |
benchmark_v2/benchmark_framework.py
Outdated
| if tokens_per_sec_measurements: | ||
| self.logger.info(f"Throughput: {tps_stats.mean:.2f}±{tps_stats.std:.2f} tokens/sec (mean±std)") | ||
| if tpot_measurements: | ||
| self.logger.info(f"TPOT: {tpot_stats.mean:.4f}±{tpot_stats.std:.4f}s/token (mean±std)") |
| logger.info(f"Created {len(scenarios)} benchmark scenarios") | ||
|
|
||
| # Create runner and execute benchmarks | ||
| runner = BenchmarkRunner(logger, output_dir) |
There was a problem hiding this comment.
Will there be cases where we run multiple Benchmarks with a single BenchmarkRunner?
There was a problem hiding this comment.
BenchmarkRunner will always run the benchmark for a single model, but all the scenarios for that model.
benchmark_v2/benchmark_framework.py
Outdated
| import torch | ||
|
|
||
|
|
||
| class WithGPU(TypedDict): |
There was a problem hiding this comment.
I didn't necessarily mean to take my suggestion literally, it was the essence of it that was important 😄
| class WithGPU(TypedDict): | |
| class GPUMetrics(TypedDict): |
may be better for this one
There was a problem hiding this comment.
GPUMetrics sounds a bit better, yes. Done.
| gpu_monitoring_status: str | ||
|
|
||
|
|
||
| class NoGPU(TypedDict): |
There was a problem hiding this comment.
That one I don't know, this should be good enough
There was a problem hiding this comment.
Kept this one, seems to capture the essence well. :D
|
|
||
|
|
||
| @dataclass | ||
| class TimingResult: |
There was a problem hiding this comment.
In that case perhaps we should add _mean[_] to field names of this class? I assume the values are means.
McPatate
left a comment
There was a problem hiding this comment.
lgtm overall!
Do you have the space ready? 👀
|
Thank you! The space is coming with tomorrows's PR and GH Actions setup, will post it here as well :) |
What does this PR do?
This PR is another iteration of reworking the benchmarking flow in Transformers. the goal is to have a similar flow as for Diffusers: daily reports to HF Datasets, and visualization in Gradio Spaces.
This PR focuses on the framework fundamentals, export to Datasets, GH actions and more model support will come in follow-ups.
From the wishlist, the first iteration includes the following
I put everything into a _v2 folder so we can keep the existing framework intact until this stabilizes a bit.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.